On January 12, 2010, a magnitude 7.0 earthquake struck Haiti causing significant damage which affected approximately 3 million citizens. In the wake of the disaster, aid groups were working to locate displaced persons and provide them with food and water. However, due to the large scale destruction of infrastructure over a wide area additional assistance was needed to locate people quickly.
Little is left of a neighborhood on a hillside near downtown Port-au-Prince on Jan. 15. More than a million people were displaced by the quake. (David Gilkey/NPR)
Displaced persons were known to be making make-shift shelters out of blue tarps. High resolution geo-refereneced images were captured by aircraft of the destroyed areas. The data generated by the image collection was too large for aid workers to process in time to supply aid. Therefore, a team from the Rochester Institute of Technology used data-mining algorithms to analyze the images and identify blue tarps. The goal was to effectively locate displaced persons and communicate their location to rescue workers so they could get resources to people who needed it in time.
Sample image of a geo-referenced image used for the analysis
As the final project for SYS 6018 - Data Mining, we were assigned to build models from the different techniques we learned in the course to, as accurately as possible, and in as timely a manner as possible, locate as many of the displaced persons identified in the imagery data so that they could be provided food and water before their situations became unsurvivable. The data made available to students consisted of a csv of red, green, blue pixel values and a class indicator which indicated if a pixel was representative of a blue tarp or something else like vegetation. A final hold-out data set presented in the format of multiple text files was provided as well.
The US Government spent $1.5B on Haiti disaster relief by the end of 2010.
The data provided for analysis was generated from overhead images and stored as a three channel output. Each pixel also had a classifier label indicating whether it was a blue tarp or something else like vegetation or soil. The channels represented the red, green, and blue values for pixels within images. RGB color model is referred to as an additive model. The integer value for the red, green, and blue channels are combined to represent a color. Typically, the component values are stored as an 8 bit integer ranging from 0 to 255.
df <- tibble(read.csv("HaitiPixels.csv")) #read in df
"Check for NA values"
anyNA(df) #check for NA values
"Summary of Data"
summary(df) #quick look at data
df$Class <- factor(df$Class) #make Class a factor variable.
#> [1] "Check for NA values"
#> [1] FALSE
#> [1] "Summary of Data"
#> Class Red Green Blue
#> Length:63241 Min. : 48 Min. : 48.0 Min. : 44.0
#> Class :character 1st Qu.: 80 1st Qu.: 78.0 1st Qu.: 63.0
#> Mode :character Median :163 Median :148.0 Median :123.0
#> Mean :163 Mean :153.7 Mean :125.1
#> 3rd Qu.:255 3rd Qu.:226.0 3rd Qu.:181.0
#> Max. :255 Max. :255.0 Max. :255.0
#Reference [1]
# The palette with grey:
cbPalette <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
# To use for fills, add
#scale_fill_manual(values=cbPalette)
ggpairs(df[,2:4], lower.panel = NULL, upper = list(continuous = wrap("cor", size = 3)), aes(color=df$Class))# + scale_fill_manual(values=cbPalette)
#view scatter and correlations
attach(df) #attach df variables
fig <- plot_ly(df, x=~Red, y=~Green, z=~Blue, color=~Class) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene=list(xaxis=list(title="Red"),
yaxis = list(title = 'Green'),
zaxis = list(title = 'Blue')))
fig
!!!!!!!!!! IF I HAVE TIME MAKE A SELECTOR TO CHOOSE COLOR SCHEME FOR NOT COLOR BLIND OR DIFFERENT KINDS OF COLOR BLIND https://socviz.co/refineplots.html
Initial inspection of the data frame indicated no missing values. The data provided is sufficiently cleaned only one further adjustment to the data frame is needed. Since our main interest is to predict whether a pixel represents a blue tarp or not a blue tarp, the Class column of the data frame needs to be converted into a binary indicator for blue tarp or not blue tarp. This is done in the next section.
The data was visualized with the ggpairs function. For a pair of variables chosen from the data frame, Ggpairs generates a scatterplot, displays a Pearson correlation, and, on the diagonal, shows a variable distribution. The plots were also color-coded by class. The class label describes what kind of object a pixel is associated with. In our data frame there were the following classes: Blue Tarp, Rooftop, Soil, Various Non-tarp, and Vegetation. The 2D representation of the data only gives us a partial insight into the behavior and relationships of the predictors. Since three channels are used to generate a color, plotting the data in 3D to investigate trends and behavior between classes will be more meaningful.
The 3D scatter plot shows a significant amount of overlap between the different classes. It is worth noting that it is possible to see some separation between the classes.
df <- cbind(mutate(df, "Blue_Tarp_or_Not"=ifelse(Class != "Blue Tarp", 0, 1))) #add binary column indicating whether the Class variable is "Blue Tarp" or not
attach(df)
df$Blue_Tarp_or_Not <- factor(Blue_Tarp_or_Not, labels = c("NBT", "BT"))#, levels =c(0,1), labels = c("NBT", "BT")) #ensure new column is a factor
"First Six Rows of Data Frame"
head(df)
df_factor <- df[, -1]
"Last Six Rows of Data Frame"
tail(df_factor)
attach(df_factor)
#> [1] "First Six Rows of Data Frame"
#> Class Red Green Blue Blue_Tarp_or_Not
#> 1 Vegetation 64 67 50 NBT
#> 2 Vegetation 64 67 50 NBT
#> 3 Vegetation 64 66 49 NBT
#> 4 Vegetation 75 82 53 NBT
#> 5 Vegetation 74 82 54 NBT
#> 6 Vegetation 72 76 52 NBT
#> [1] "Last Six Rows of Data Frame"
#> Red Green Blue Blue_Tarp_or_Not
#> 63236 136 145 150 BT
#> 63237 138 146 150 BT
#> 63238 134 141 152 BT
#> 63239 136 143 151 BT
#> 63240 132 139 149 BT
#> 63241 133 141 153 BT
fig1 <- plot_ly(df_factor, x=~Red, y=~Green, z=~Blue, color=~Blue_Tarp_or_Not) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig1 <- fig1 %>% add_markers()
fig1 <- fig1 %>% layout(scene=list(xaxis=list(title="Red"),
yaxis = list(title = 'Green'),
zaxis = list(title = 'Blue')))
fig1
After the class label is converted into a binary classifier, it is easier to see separation between the data points for blue tarps and not blue tarps.
#In order to make run times faster when tuning parameters subset data with 20%
trainIndex <- createDataPartition(df_factor$Blue_Tarp_or_Not, p=0.2,
list=FALSE,
times=1)
df_subset <- df_factor[trainIndex,]
Fit a Logistic Regression Model !!!Need to turn on the fold result saving …
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
glm.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_subset, #df_factor,
method="glm",
family="binomial",
trControl= fitControl)
glm.fit
"Summary"
summary(glm.fit)
#> Generalized Linear Model
#>
#> 12649 samples
#> 3 predictor
#> 2 classes: 'NBT', 'BT'
#>
#> No pre-processing
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 11384, 11385, 11384, 11383, 11383, 11384, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.9951774 0.9182797
#>
#> [1] "Summary"
#>
#> Call:
#> NULL
#>
#> Deviance Residuals:
#> Min 1Q Median 3Q Max
#> -2.93552 -0.01440 -0.00079 0.00000 3.10630
#>
#> Coefficients:
#> Estimate Std. Error z value Pr(>|z|)
#> (Intercept) 0.54001 0.44609 1.211 0.226
#> Red -0.28753 0.03240 -8.875 < 2e-16 ***
#> Green -0.24208 0.03346 -7.234 4.68e-13 ***
#> Blue 0.52225 0.04078 12.807 < 2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> (Dispersion parameter for binomial family taken to be 1)
#>
#> Null deviance: 3584.46 on 12648 degrees of freedom
#> Residual deviance: 320.94 on 12645 degrees of freedom
#> AIC: 328.94
#>
#> Number of Fisher Scoring iterations: 12
Test model performance on Train data to select threshold values…
#pass
glm.prob <- predict(glm.fit, newdata=df_subset , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
glm_roc <- roc(df_subset $Blue_Tarp_or_Not, glm.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#965fd4", lwd=4, print.auc=TRUE, main="GLM ROC Curve")
roc.info_glm <- roc(df_subset$Blue_Tarp_or_Not, glm.prob[,2], legacy.axes=TRUE)
roc.glm.df <- data.frame(tpp=roc.info_glm$sensitivities*100, fpp=(1-roc.info_glm$specificities)*100, thresholds=roc.info_glm$thresholds)
#roc.glm.df[roc.glm.df>98.5 & roc.glm.df < 99,]
glm.thresholds <- data.matrix(roc.glm.df$thresholds)
fig2 <- plot_ly(roc.glm.df, x=~tpp, y=~fpp, z=~thresholds) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig2 <- fig2 %>% add_markers()
fig2 <- fig2 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
yaxis = list(title = 'False Positive Rate'),
zaxis = list(title = 'Threshold')))
fig2
lr.thresh <- 0.5
glm.pred_thresh <- factor(ifelse(glm.prob[,2]>lr.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.glm_thresh <- confusionMatrix(factor(glm.pred_thresh),df_subset $Blue_Tarp_or_Not, positive = "BT")
"Threshold: 0.5"
cm.glm_thresh
acc_LR <- cm.glm_thresh[["overall"]][["Accuracy"]]*100
auc_LR <- glm_roc[["auc"]]
thresh_LR <- lr.thresh
sens_LR <- cm.glm_thresh[["byClass"]][["Sensitivity"]]*100
spec_LR <- cm.glm_thresh[["byClass"]][["Specificity"]]*100
FDR_LR <- ((cm.glm_thresh[["table"]][2,1])/(cm.glm_thresh[["table"]][2,1]+cm.glm_thresh[["table"]][2,2]))*100
prec_LR <- cm.glm_thresh[["byClass"]][["Precision"]]*100
#> [1] "Threshold: 0.5"
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 12230 46
#> BT 14 359
#>
#> Accuracy : 0.9953
#> 95% CI : (0.9939, 0.9964)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9204
#>
#> Mcnemar's Test P-Value : 6.279e-05
#>
#> Sensitivity : 0.88642
#> Specificity : 0.99886
#> Pos Pred Value : 0.96247
#> Neg Pred Value : 0.99625
#> Prevalence : 0.03202
#> Detection Rate : 0.02838
#> Detection Prevalence : 0.02949
#> Balanced Accuracy : 0.94264
#>
#> 'Positive' Class : BT
#>
"10 Fold Results"
glm.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
glm.sd <- sd(glm.fit[["resample"]][["Accuracy"]]*100)
"Standard Deviation of 10 Folds Accuracy"
glm.sd
#> [1] "10 Fold Results"
#> Accuracy Kappa parameter Resample
#> 1 0.9952569 0.9186146 none Fold01
#> 2 0.9944620 0.9038261 none Fold02
#> 3 0.9905138 0.8372293 none Fold03
#> 4 0.9976303 0.9626306 none Fold04
#> 5 0.9936809 0.8915004 none Fold05
#> 6 0.9960474 0.9362303 none Fold06
#> 7 0.9952569 0.9206366 none Fold07
#> 8 0.9976266 0.9617372 none Fold08
#> 9 0.9936709 0.8886540 none Fold09
#> 10 0.9976285 0.9617382 none Fold10
#> [1] "Standard Deviation of 10 Folds Accuracy"
#> [1] 0.2249998
The average accuracy across ten folds is 99.53 with a standard deviation of 0.225.
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
lda.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_subset, #df_factor,,
preProcess=c("center","scale"),
method="lda",
verbose= FALSE,
trControl= fitControl)
lda.fit
"Summary"
summary(lda.fit)
"10 Fold Results"
lda.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
#> Linear Discriminant Analysis
#>
#> 12649 samples
#> 3 predictor
#> 2 classes: 'NBT', 'BT'
#>
#> Pre-processing: centered (3), scaled (3)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 11384, 11385, 11384, 11383, 11383, 11384, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.9840305 0.7569777
#>
#> [1] "Summary"
#> Length Class Mode
#> prior 2 -none- numeric
#> counts 2 -none- numeric
#> means 6 -none- numeric
#> scaling 3 -none- numeric
#> lev 2 -none- character
#> svd 1 -none- numeric
#> N 1 -none- numeric
#> call 4 -none- call
#> xNames 3 -none- character
#> problemType 1 -none- character
#> tuneValue 1 data.frame list
#> obsLevels 2 -none- character
#> param 1 -none- list
#> [1] "10 Fold Results"
#> Accuracy Kappa parameter Resample
#> 1 0.9865613 0.7778753 none Fold01
#> 2 0.9833861 0.7555261 none Fold02
#> 3 0.9802372 0.6885924 none Fold03
#> 4 0.9842022 0.7791694 none Fold04
#> 5 0.9849921 0.7576814 none Fold05
#> 6 0.9849802 0.7787831 none Fold06
#> 7 0.9833992 0.7384073 none Fold07
#> 8 0.9889241 0.8352205 none Fold08
#> 9 0.9825949 0.7291147 none Fold09
#> 10 0.9810277 0.7294069 none Fold10
#pass
lda.prob <- predict(lda.fit, newdata=df_subset, type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
lda_roc <- roc(df_subset$Blue_Tarp_or_Not, lda.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#965fd4", lwd=4, print.auc=TRUE, main="LDA ROC Curve")
roc.info_lda <- roc(df_subset $Blue_Tarp_or_Not, lda.prob[,2], legacy.axes=TRUE)
roc.lda.df <- data.frame(tpp=roc.info_lda$sensitivities*100, fpp=(1-roc.info_lda$specificities)*100, thresholds=roc.info_lda$thresholds)
#roc.lda.df[roc.lda.df>91.5 & roc.lda.df < 91.6,]
fig3 <- plot_ly(roc.lda.df, x=~tpp, y=~fpp, z=~thresholds) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig3 <- fig3 %>% add_markers()
fig3 <- fig3 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
yaxis = list(title = 'False Positive Rate'),
zaxis = list(title = 'Threshold')))
fig3
lda.thresh <- 0.5
lda.pred_thresh <- factor(ifelse(lda.prob[,2]>lda.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.lda_thresh <- confusionMatrix(factor(lda.pred_thresh),df_subset$Blue_Tarp_or_Not, positive = "BT")
"Threshold: 0.5"
cm.lda_thresh
acc_lda <- cm.lda_thresh[["overall"]][["Accuracy"]]*100
auc_lda <- lda_roc[["auc"]]
thresh_lda <- lr.thresh
sens_lda <- cm.lda_thresh[["byClass"]][["Sensitivity"]]*100
spec_lda <- cm.lda_thresh[["byClass"]][["Specificity"]]*100
FDR_lda <- ((cm.lda_thresh[["table"]][2,1])/(cm.lda_thresh[["table"]][2,1]+cm.lda_thresh[["table"]][2,2]))*100
prec_lda <- cm.lda_thresh[["byClass"]][["Precision"]]*100
#> [1] "Threshold: 0.5"
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 12117 76
#> BT 127 329
#>
#> Accuracy : 0.984
#> 95% CI : (0.9816, 0.9861)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.756
#>
#> Mcnemar's Test P-Value : 0.0004493
#>
#> Sensitivity : 0.81235
#> Specificity : 0.98963
#> Pos Pred Value : 0.72149
#> Neg Pred Value : 0.99377
#> Prevalence : 0.03202
#> Detection Rate : 0.02601
#> Detection Prevalence : 0.03605
#> Balanced Accuracy : 0.90099
#>
#> 'Positive' Class : BT
#>
"10 Fold Results"
lda.fit$resample
lda.sd <- sd(lda.fit[["resample"]][["Accuracy"]]*100)
"Standard Deviation of 10 Folds Accuracy"
lda.sd
#> [1] "10 Fold Results"
#> Accuracy Kappa parameter Resample
#> 1 0.9865613 0.7778753 none Fold01
#> 2 0.9833861 0.7555261 none Fold02
#> 3 0.9802372 0.6885924 none Fold03
#> 4 0.9842022 0.7791694 none Fold04
#> 5 0.9849921 0.7576814 none Fold05
#> 6 0.9849802 0.7787831 none Fold06
#> 7 0.9833992 0.7384073 none Fold07
#> 8 0.9889241 0.8352205 none Fold08
#> 9 0.9825949 0.7291147 none Fold09
#> 10 0.9810277 0.7294069 none Fold10
#> [1] "Standard Deviation of 10 Folds Accuracy"
#> [1] 0.2549268
The average accuracy across ten folds is 98.4 with a standard deviation of 0.255.
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
qda.fit <- caret::train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_subset, #df_factor,,
preProcess=c("center","scale"),
method="qda",
verbose= FALSE,
trControl= fitControl)
qda.fit
"Summary"
summary(qda.fit)
"10 Fold Results"
qda.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
#> Quadratic Discriminant Analysis
#>
#> 12649 samples
#> 3 predictor
#> 2 classes: 'NBT', 'BT'
#>
#> Pre-processing: centered (3), scaled (3)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 11384, 11385, 11384, 11383, 11383, 11384, ...
#> Resampling results:
#>
#> Accuracy Kappa
#> 0.9950986 0.9127109
#>
#> [1] "Summary"
#> Length Class Mode
#> prior 2 -none- numeric
#> counts 2 -none- numeric
#> means 6 -none- numeric
#> scaling 18 -none- numeric
#> ldet 2 -none- numeric
#> lev 2 -none- character
#> N 1 -none- numeric
#> call 4 -none- call
#> xNames 3 -none- character
#> problemType 1 -none- character
#> tuneValue 1 data.frame list
#> obsLevels 2 -none- character
#> param 1 -none- list
#> [1] "10 Fold Results"
#> Accuracy Kappa parameter Resample
#> 1 0.9944664 0.9012876 none Fold01
#> 2 0.9936709 0.8856729 none Fold02
#> 3 0.9905138 0.8188112 none Fold03
#> 4 0.9984202 0.9747934 none Fold04
#> 5 0.9921011 0.8571267 none Fold05
#> 6 0.9952569 0.9206366 none Fold06
#> 7 0.9960474 0.9330355 none Fold07
#> 8 1.0000000 1.0000000 none Fold08
#> 9 0.9936709 0.8886540 none Fold09
#> 10 0.9968379 0.9470911 none Fold10
#pass
qda.prob <- predict(qda.fit, newdata=df_subset , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
qda_roc <- roc(df_subset $Blue_Tarp_or_Not, qda.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#965fd4", lwd=4, print.auc=TRUE, main="QDA ROC Curve")
roc.info_qda <- roc(df_subset$Blue_Tarp_or_Not, qda.prob[,2], legacy.axes=TRUE)
roc.qda.df <- data.frame(tpp=roc.info_qda$sensitivities*100, fpp=(1-roc.info_qda$specificities)*100, thresholds=roc.info_qda$thresholds)
#roc.qda.df[roc.qda.df>98 & roc.qda.df < 99,]
fig4 <- plot_ly(roc.qda.df, x=~tpp, y=~fpp, z=~thresholds) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig4 <- fig4 %>% add_markers()
fig4 <- fig4 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
yaxis = list(title = 'False Positive Rate'),
zaxis = list(title = 'Threshold')))
fig4
qda.thresh <- 0.5
qda.pred_thresh <- factor(ifelse(qda.prob[,2]>qda.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.qda_thresh <- confusionMatrix(factor(qda.pred_thresh),df_subset $Blue_Tarp_or_Not, positive = "BT")
"Threshold: 0.5"
cm.qda_thresh
acc_qda <- cm.qda_thresh[["overall"]][["Accuracy"]]*100
auc_qda <- qda_roc[["auc"]]
thresh_qda <- lr.thresh
sens_qda <- cm.qda_thresh[["byClass"]][["Sensitivity"]]*100
spec_qda <- cm.qda_thresh[["byClass"]][["Specificity"]]*100
FDR_qda <- ((cm.qda_thresh[["table"]][2,1])/(cm.qda_thresh[["table"]][2,1]+cm.qda_thresh[["table"]][2,2]))*100
prec_qda <- cm.qda_thresh[["byClass"]][["Precision"]]*100
#> [1] "Threshold: 0.5"
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 12241 60
#> BT 3 345
#>
#> Accuracy : 0.995
#> 95% CI : (0.9936, 0.9962)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9138
#>
#> Mcnemar's Test P-Value : 1.722e-12
#>
#> Sensitivity : 0.85185
#> Specificity : 0.99975
#> Pos Pred Value : 0.99138
#> Neg Pred Value : 0.99512
#> Prevalence : 0.03202
#> Detection Rate : 0.02727
#> Detection Prevalence : 0.02751
#> Balanced Accuracy : 0.92580
#>
#> 'Positive' Class : BT
#>
"10 Fold Results"
qda.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
qda.sd <- sd(qda.fit[["resample"]][["Accuracy"]]*100)
"Standard Deviation of 10 Folds Accuracy"
qda.sd
#> [1] "10 Fold Results"
#> Accuracy Kappa parameter Resample
#> 1 0.9944664 0.9012876 none Fold01
#> 2 0.9936709 0.8856729 none Fold02
#> 3 0.9905138 0.8188112 none Fold03
#> 4 0.9984202 0.9747934 none Fold04
#> 5 0.9921011 0.8571267 none Fold05
#> 6 0.9952569 0.9206366 none Fold06
#> 7 0.9960474 0.9330355 none Fold07
#> 8 1.0000000 1.0000000 none Fold08
#> 9 0.9936709 0.8886540 none Fold09
#> 10 0.9968379 0.9470911 none Fold10
#> [1] "Standard Deviation of 10 Folds Accuracy"
#> [1] 0.2857524
The average accuracy across ten folds is 99.5 with a standard deviation of 0.286.
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
knn.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_subset, #df_factor,,
preProcess=c("center","scale"),
method="knn",
trControl= fitControl,
tuneLength=5
)
knn.fit
"Summary"
summary(knn.fit)
"10 Fold Results"
knn.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
#> k-Nearest Neighbors
#>
#> 12649 samples
#> 3 predictor
#> 2 classes: 'NBT', 'BT'
#>
#> Pre-processing: centered (3), scaled (3)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 11384, 11385, 11384, 11383, 11383, 11384, ...
#> Resampling results across tuning parameters:
#>
#> k Accuracy Kappa
#> 5 0.9968377 0.9492194
#> 7 0.9969960 0.9516062
#> 9 0.9971540 0.9537382
#> 11 0.9971538 0.9537780
#> 13 0.9969957 0.9512617
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was k = 9.
#> [1] "Summary"
#> Length Class Mode
#> learn 2 -none- list
#> k 1 -none- numeric
#> theDots 0 -none- list
#> xNames 3 -none- character
#> problemType 1 -none- character
#> tuneValue 1 data.frame list
#> obsLevels 2 -none- character
#> param 0 -none- list
#> [1] "10 Fold Results"
#> Accuracy Kappa k Resample
#> 1 0.9952569 0.9261242 5 Fold01
#> 2 0.9952569 0.9261242 7 Fold01
#> 3 0.9968379 0.9507495 9 Fold01
#> 4 0.9968379 0.9507495 11 Fold01
#> 5 0.9968379 0.9507495 13 Fold01
#> 6 0.9968354 0.9470864 5 Fold02
#> 7 0.9976266 0.9608005 7 Fold02
#> 8 0.9976266 0.9608005 9 Fold02
#> 9 0.9968354 0.9470864 11 Fold02
#> 10 0.9960443 0.9346675 13 Fold02
#> 11 0.9944664 0.9085369 5 Fold03
#> 12 0.9936759 0.8941755 7 Fold03
#> 13 0.9928854 0.8794537 9 Fold03
#> 14 0.9936759 0.8941755 11 Fold03
#> 15 0.9936759 0.8941755 13 Fold03
#> 16 0.9976303 0.9634815 5 Fold04
#> 17 0.9968404 0.9518567 7 Fold04
#> 18 0.9984202 0.9753740 9 Fold04
#> 19 0.9984202 0.9753740 11 Fold04
#> 20 0.9976303 0.9626306 13 Fold04
#> 21 0.9976303 0.9617392 5 Fold05
#> 22 0.9968404 0.9483697 7 Fold05
#> 23 0.9968404 0.9483697 9 Fold05
#> 24 0.9968404 0.9483697 11 Fold05
#> 25 0.9968404 0.9483697 13 Fold05
#> 26 0.9960474 0.9377160 5 Fold06
#> 27 0.9968379 0.9495855 7 Fold06
#> 28 0.9968379 0.9495855 9 Fold06
#> 29 0.9968379 0.9495855 11 Fold06
#> 30 0.9968379 0.9495855 13 Fold06
#> 31 0.9968379 0.9483684 5 Fold07
#> 32 0.9984190 0.9747928 7 Fold07
#> 33 0.9976285 0.9617382 9 Fold07
#> 34 0.9984190 0.9747928 11 Fold07
#> 35 0.9992095 0.9875432 13 Fold07
#> 36 0.9968354 0.9495852 5 Fold08
#> 37 0.9984177 0.9741830 7 Fold08
#> 38 0.9968354 0.9495852 9 Fold08
#> 39 0.9976266 0.9617372 11 Fold08
#> 40 0.9976266 0.9617372 13 Fold08
#> 41 0.9984177 0.9741830 5 Fold09
#> 42 0.9976266 0.9608005 7 Fold09
#> 43 0.9984177 0.9741830 9 Fold09
#> 44 0.9968354 0.9483660 11 Fold09
#> 45 0.9968354 0.9483660 13 Fold09
#> 46 0.9984190 0.9753733 5 Fold10
#> 47 0.9984190 0.9753733 7 Fold10
#> 48 0.9992095 0.9875432 9 Fold10
#> 49 0.9992095 0.9875432 11 Fold10
#> 50 0.9984190 0.9747928 13 Fold10
#pass
knn.prob <- predict(knn.fit, newdata=df_subset , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
knn_roc <- roc(df_subset $Blue_Tarp_or_Not, knn.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#965fd4", lwd=4, print.auc=TRUE, main="KNN ROC Curve")
Not sure why there are only 10 values for this one…?
roc.info_knn <- roc(df_subset$Blue_Tarp_or_Not, knn.prob[,2], legacy.axes=TRUE)
roc.knn.df <- data.frame(tpp=roc.info_knn$sensitivities*100, fpp=(1-roc.info_knn$specificities)*100, thresholds=roc.info_knn$thresholds)
#roc.knn.df[roc.knn.df>99 & roc.knn.df < 100,]
#roc.knn.df
fig5 <- plot_ly(roc.knn.df, x=~tpp, y=~fpp, z=~thresholds) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig5 <- fig5 %>% add_markers()
fig5 <- fig5 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
yaxis = list(title = 'False Positive Rate'),
zaxis = list(title = 'Threshold')))
fig5
knn.thresh <- 0.5
knn.pred_thresh <- factor(ifelse(knn.prob[,2]>knn.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.knn_thresh <- confusionMatrix(factor(knn.pred_thresh),df_subset $Blue_Tarp_or_Not, positive = "BT")
"Threshold: 0.5"
cm.knn_thresh
acc_knn <- cm.knn_thresh[["overall"]][["Accuracy"]]*100
auc_knn <- knn_roc[["auc"]]
thresh_knn <- lr.thresh
sens_knn <- cm.knn_thresh[["byClass"]][["Sensitivity"]]*100
spec_knn <- cm.knn_thresh[["byClass"]][["Specificity"]]*100
FDR_knn <- ((cm.knn_thresh[["table"]][2,1])/(cm.knn_thresh[["table"]][2,1]+cm.knn_thresh[["table"]][2,2]))*100
prec_knn <- cm.knn_thresh[["byClass"]][["Precision"]]*100
k_knn <- knn.fit[["bestTune"]][["k"]]
#> [1] "Threshold: 0.5"
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 12229 15
#> BT 15 390
#>
#> Accuracy : 0.9976
#> 95% CI : (0.9966, 0.9984)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9617
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.96296
#> Specificity : 0.99877
#> Pos Pred Value : 0.96296
#> Neg Pred Value : 0.99877
#> Prevalence : 0.03202
#> Detection Rate : 0.03083
#> Detection Prevalence : 0.03202
#> Balanced Accuracy : 0.98087
#>
#> 'Positive' Class : BT
#>
"10 Fold Results"
knn.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
knn.sd <- sd(knn.fit[["resample"]][["Accuracy"]]*100)
"Standard Deviation of 10 Folds Accuracy"
knn.sd
#> [1] "10 Fold Results"
#> Accuracy Kappa k Resample
#> 1 0.9952569 0.9261242 5 Fold01
#> 2 0.9952569 0.9261242 7 Fold01
#> 3 0.9968379 0.9507495 9 Fold01
#> 4 0.9968379 0.9507495 11 Fold01
#> 5 0.9968379 0.9507495 13 Fold01
#> 6 0.9968354 0.9470864 5 Fold02
#> 7 0.9976266 0.9608005 7 Fold02
#> 8 0.9976266 0.9608005 9 Fold02
#> 9 0.9968354 0.9470864 11 Fold02
#> 10 0.9960443 0.9346675 13 Fold02
#> 11 0.9944664 0.9085369 5 Fold03
#> 12 0.9936759 0.8941755 7 Fold03
#> 13 0.9928854 0.8794537 9 Fold03
#> 14 0.9936759 0.8941755 11 Fold03
#> 15 0.9936759 0.8941755 13 Fold03
#> 16 0.9976303 0.9634815 5 Fold04
#> 17 0.9968404 0.9518567 7 Fold04
#> 18 0.9984202 0.9753740 9 Fold04
#> 19 0.9984202 0.9753740 11 Fold04
#> 20 0.9976303 0.9626306 13 Fold04
#> 21 0.9976303 0.9617392 5 Fold05
#> 22 0.9968404 0.9483697 7 Fold05
#> 23 0.9968404 0.9483697 9 Fold05
#> 24 0.9968404 0.9483697 11 Fold05
#> 25 0.9968404 0.9483697 13 Fold05
#> 26 0.9960474 0.9377160 5 Fold06
#> 27 0.9968379 0.9495855 7 Fold06
#> 28 0.9968379 0.9495855 9 Fold06
#> 29 0.9968379 0.9495855 11 Fold06
#> 30 0.9968379 0.9495855 13 Fold06
#> 31 0.9968379 0.9483684 5 Fold07
#> 32 0.9984190 0.9747928 7 Fold07
#> 33 0.9976285 0.9617382 9 Fold07
#> 34 0.9984190 0.9747928 11 Fold07
#> 35 0.9992095 0.9875432 13 Fold07
#> 36 0.9968354 0.9495852 5 Fold08
#> 37 0.9984177 0.9741830 7 Fold08
#> 38 0.9968354 0.9495852 9 Fold08
#> 39 0.9976266 0.9617372 11 Fold08
#> 40 0.9976266 0.9617372 13 Fold08
#> 41 0.9984177 0.9741830 5 Fold09
#> 42 0.9976266 0.9608005 7 Fold09
#> 43 0.9984177 0.9741830 9 Fold09
#> 44 0.9968354 0.9483660 11 Fold09
#> 45 0.9968354 0.9483660 13 Fold09
#> 46 0.9984190 0.9753733 5 Fold10
#> 47 0.9984190 0.9753733 7 Fold10
#> 48 0.9992095 0.9875432 9 Fold10
#> 49 0.9992095 0.9875432 11 Fold10
#> 50 0.9984190 0.9747928 13 Fold10
#> [1] "Standard Deviation of 10 Folds Accuracy"
#> [1] 0.1451141
The average accuracy across ten folds is 99.76 with a standard deviation of 0.145.
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
rf.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_subset, #df_factor,,
preProcess=c("center","scale"),
method="rf", #what is the difference between the different caret rf models??
trControl= fitControl,
tuneLength=3
)
rf.fit
"Summary"
summary(rf.fit)
"10 Fold Results"
rf.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
#> note: only 2 unique complexity parameters in default grid. Truncating the grid to 2 .
#>
#> Random Forest
#>
#> 12649 samples
#> 3 predictor
#> 2 classes: 'NBT', 'BT'
#>
#> Pre-processing: centered (3), scaled (3)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 11384, 11385, 11384, 11383, 11383, 11384, ...
#> Resampling results across tuning parameters:
#>
#> mtry Accuracy Kappa
#> 2 0.9969956 0.9507627
#> 3 0.9967583 0.9473050
#>
#> Accuracy was used to select the optimal model using the largest value.
#> The final value used for the model was mtry = 2.
#> [1] "Summary"
#> Length Class Mode
#> call 4 -none- call
#> type 1 -none- character
#> predicted 12649 factor numeric
#> err.rate 1500 -none- numeric
#> confusion 6 -none- numeric
#> votes 25298 matrix numeric
#> oob.times 12649 -none- numeric
#> classes 2 -none- character
#> importance 3 -none- numeric
#> importanceSD 0 -none- NULL
#> localImportance 0 -none- NULL
#> proximity 0 -none- NULL
#> ntree 1 -none- numeric
#> mtry 1 -none- numeric
#> forest 14 -none- list
#> y 12649 factor numeric
#> test 0 -none- NULL
#> inbag 0 -none- NULL
#> xNames 3 -none- character
#> problemType 1 -none- character
#> tuneValue 1 data.frame list
#> obsLevels 2 -none- character
#> param 0 -none- list
#> [1] "10 Fold Results"
#> Accuracy Kappa mtry Resample
#> 1 0.9952569 0.9243798 2 Fold01
#> 2 0.9944664 0.9128059 3 Fold01
#> 3 0.9968354 0.9470864 2 Fold02
#> 4 0.9960443 0.9346675 3 Fold02
#> 5 0.9936759 0.8941755 2 Fold03
#> 6 0.9936759 0.8941755 3 Fold03
#> 7 0.9984202 0.9753740 2 Fold04
#> 8 0.9984202 0.9753740 3 Fold04
#> 9 0.9976303 0.9617392 2 Fold05
#> 10 0.9976303 0.9617392 3 Fold05
#> 11 0.9968379 0.9483684 2 Fold06
#> 12 0.9968379 0.9495855 3 Fold06
#> 13 0.9976285 0.9617382 2 Fold07
#> 14 0.9984190 0.9747928 3 Fold07
#> 15 0.9976266 0.9617372 2 Fold08
#> 16 0.9976266 0.9617372 3 Fold08
#> 17 0.9960443 0.9330281 2 Fold09
#> 18 0.9952532 0.9206296 3 Fold09
#> 19 1.0000000 1.0000000 2 Fold10
#> 20 0.9992095 0.9875432 3 Fold10
#pass
RF.prob <- predict(rf.fit, newdata=df_subset , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
RF_roc <- roc(df_subset $Blue_Tarp_or_Not, RF.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#965fd4", lwd=4, print.auc=TRUE, main="RF ROC Curve")
roc.info_rf <- roc(df_subset$Blue_Tarp_or_Not, RF.prob[,2], legacy.axes=TRUE)
roc.rf.df <- data.frame(tpp=roc.info_rf$sensitivities*100, fpp=(1-roc.info_rf$specificities)*100, thresholds=roc.info_rf$thresholds)
#roc.rf.df[roc.rf.df>99 & roc.rf.df < 100,]
#roc.rf.df
fig6 <- plot_ly(roc.rf.df, x=~tpp, y=~fpp, z=~thresholds) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig6 <- fig6 %>% add_markers()
fig6 <- fig6 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
yaxis = list(title = 'False Positive Rate'),
zaxis = list(title = 'Threshold')))
fig6
RF.thresh <- 0.5
RF.pred_thresh <- factor(ifelse(RF.prob[,2]>RF.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.RF_thresh <- confusionMatrix(factor(RF.pred_thresh),df_subset $Blue_Tarp_or_Not, positive = "BT")
"Threshold: 0.5"
cm.RF_thresh
acc_RF <- cm.RF_thresh[["overall"]][["Accuracy"]]*100
auc_RF <- RF_roc[["auc"]]
thresh_RF <- lr.thresh
sens_RF <- cm.RF_thresh[["byClass"]][["Sensitivity"]]*100
spec_RF <- cm.RF_thresh[["byClass"]][["Specificity"]]*100
FDR_RF <- ((cm.RF_thresh[["table"]][2,1])/(cm.RF_thresh[["table"]][2,1]+cm.RF_thresh[["table"]][2,2]))*100
prec_RF <- cm.RF_thresh[["byClass"]][["Precision"]]*100
#> [1] "Threshold: 0.5"
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 12244 7
#> BT 0 398
#>
#> Accuracy : 0.9994
#> 95% CI : (0.9989, 0.9998)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : < 2e-16
#>
#> Kappa : 0.991
#>
#> Mcnemar's Test P-Value : 0.02334
#>
#> Sensitivity : 0.98272
#> Specificity : 1.00000
#> Pos Pred Value : 1.00000
#> Neg Pred Value : 0.99943
#> Prevalence : 0.03202
#> Detection Rate : 0.03146
#> Detection Prevalence : 0.03146
#> Balanced Accuracy : 0.99136
#>
#> 'Positive' Class : BT
#>
"10 Fold Results"
rf.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
rf.sd <- sd(rf.fit[["resample"]][["Accuracy"]]*100)
"Standard Deviation of 10 Folds Accuracy"
rf.sd
#> [1] "10 Fold Results"
#> Accuracy Kappa mtry Resample
#> 1 0.9952569 0.9243798 2 Fold01
#> 2 0.9944664 0.9128059 3 Fold01
#> 3 0.9968354 0.9470864 2 Fold02
#> 4 0.9960443 0.9346675 3 Fold02
#> 5 0.9936759 0.8941755 2 Fold03
#> 6 0.9936759 0.8941755 3 Fold03
#> 7 0.9984202 0.9753740 2 Fold04
#> 8 0.9984202 0.9753740 3 Fold04
#> 9 0.9976303 0.9617392 2 Fold05
#> 10 0.9976303 0.9617392 3 Fold05
#> 11 0.9968379 0.9483684 2 Fold06
#> 12 0.9968379 0.9495855 3 Fold06
#> 13 0.9976285 0.9617382 2 Fold07
#> 14 0.9984190 0.9747928 3 Fold07
#> 15 0.9976266 0.9617372 2 Fold08
#> 16 0.9976266 0.9617372 3 Fold08
#> 17 0.9960443 0.9330281 2 Fold09
#> 18 0.9952532 0.9206296 3 Fold09
#> 19 1.0000000 1.0000000 2 Fold10
#> 20 0.9992095 0.9875432 3 Fold10
#> [1] "Standard Deviation of 10 Folds Accuracy"
#> [1] 0.1748925
The average accuracy across ten folds is 99.94 with a standard deviation of 0.175.
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
svm.radial.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_subset, #df_factor,,
preProcess=c("center","scale"),
method="svmRadial",
trControl= fitControl,
tuneLength=3
)
svm.radial.fit
"Summary"
summary(svm.radial.fit)
"10 Fold Results"
svm.radial.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
#> Support Vector Machines with Radial Basis Function Kernel
#>
#> 12649 samples
#> 3 predictor
#> 2 classes: 'NBT', 'BT'
#>
#> Pre-processing: centered (3), scaled (3)
#> Resampling: Cross-Validated (10 fold)
#> Summary of sample sizes: 11384, 11385, 11384, 11383, 11383, 11384, ...
#> Resampling results across tuning parameters:
#>
#> C Accuracy Kappa
#> 0.25 0.9963634 0.9406356
#> 0.50 0.9969957 0.9506323
#> 1.00 0.9969957 0.9504009
#>
#> Tuning parameter 'sigma' was held constant at a value of 8.543076
#> Accuracy was used to select the optimal model using the largest value.
#> The final values used for the model were sigma = 8.543076 and C = 0.5.
#> [1] "Summary"
#> Length Class Mode
#> 1 ksvm S4
#> [1] "10 Fold Results"
#> Accuracy Kappa sigma C Resample
#> 1 0.9952569 0.9261242 8.543076 0.25 Fold01
#> 2 0.9960474 0.9362303 8.543076 0.50 Fold01
#> 3 0.9952569 0.9243798 8.543076 1.00 Fold01
#> 4 0.9952532 0.9186126 8.543076 0.25 Fold02
#> 5 0.9960443 0.9330281 8.543076 0.50 Fold02
#> 6 0.9960443 0.9330281 8.543076 1.00 Fold02
#> 7 0.9920949 0.8709184 8.543076 0.25 Fold03
#> 8 0.9928854 0.8794537 8.543076 0.50 Fold03
#> 9 0.9936759 0.8914862 8.543076 1.00 Fold03
#> 10 0.9984202 0.9753740 8.543076 0.25 Fold04
#> 11 0.9992101 0.9875435 8.543076 0.50 Fold04
#> 12 0.9992101 0.9875435 8.543076 1.00 Fold04
#> 13 0.9960506 0.9346736 8.543076 0.25 Fold05
#> 14 0.9968404 0.9483697 8.543076 0.50 Fold05
#> 15 0.9968404 0.9483697 8.543076 1.00 Fold05
#> 16 0.9968379 0.9495855 8.543076 0.25 Fold06
#> 17 0.9968379 0.9495855 8.543076 0.50 Fold06
#> 18 0.9968379 0.9495855 8.543076 1.00 Fold06
#> 19 0.9944664 0.9085407 8.543076 0.25 Fold07
#> 20 0.9968379 0.9495855 8.543076 0.50 Fold07
#> 21 0.9976285 0.9608031 8.543076 1.00 Fold07
#> 22 0.9984177 0.9741830 8.543076 0.25 Fold08
#> 23 0.9984177 0.9741830 8.543076 0.50 Fold08
#> 24 0.9984177 0.9741830 8.543076 1.00 Fold08
#> 25 0.9976266 0.9608005 8.543076 0.25 Fold09
#> 26 0.9976266 0.9608005 8.543076 0.50 Fold09
#> 27 0.9968354 0.9470864 8.543076 1.00 Fold09
#> 28 0.9992095 0.9875432 8.543076 0.25 Fold10
#> 29 0.9992095 0.9875432 8.543076 0.50 Fold10
#> 30 0.9992095 0.9875432 8.543076 1.00 Fold10
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
svm.linear.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_subset, #df_factor ,,
preProcess=c("center","scale"),
method="svmLinear",
trControl= fitControl,
tuneLength=3
)
svm.linear.fit
#pass
fitControl <- trainControl(method = "cv",
number = 10,
returnResamp = 'all',
savePredictions = 'final',
classProbs = TRUE)
set.seed(4)
svm.poly.fit <- train(Blue_Tarp_or_Not~Red+Green+Blue,
data = df_subset, #df_factor,,
preProcess=c("center","scale"),
method="svmPoly",
trControl= fitControl,
tuneLength=3
)
svm.poly.fit
#pass
SVM.prob <- predict(svm.radial.fit, newdata=df_subset , type = "prob") #returns df with col 0 (prob not blue tarp) and 1 (prob blue tarp)
par(pty="s")
SVM_roc <- roc(df_subset $Blue_Tarp_or_Not, SVM.prob[,2], plot=TRUE, legacy.axes=TRUE, percent=TRUE, xlab="False Positive Percentage", ylab="True Positive Percentage", col="#965fd4", lwd=4, print.auc=TRUE, main="SVM ROC Curve")
roc.info_svm <- roc(df_subset$Blue_Tarp_or_Not, SVM.prob[,2], legacy.axes=TRUE)
roc.svm.df <- data.frame(tpp=roc.info_svm$sensitivities*100, fpp=(1-roc.info_svm$specificities)*100, thresholds=roc.info_svm$thresholds)
#roc.svm.df[roc.svm.df>99 & roc.svm.df < 100,]
#roc.svm.df
fig7 <- plot_ly(roc.svm.df, x=~tpp, y=~fpp, z=~thresholds) #Reference: https://plotly.com/r/3d-scatter-plots/ https://plotly.com/r/3d-surface-plots/
fig7 <- fig7 %>% add_markers()
fig7 <- fig7 %>% layout(scene=list(xaxis=list(title="True Positive Rate"),
yaxis = list(title = 'False Positive Rate'),
zaxis = list(title = 'Threshold')))
fig7
SVM.thresh <- 0.5
SVM.pred_thresh <- factor(ifelse(SVM.prob[,2]>SVM.thresh,"BT", "NBT"), levels=c("NBT", "BT"))
cm.SVM_thresh <- confusionMatrix(factor(SVM.pred_thresh),df_subset $Blue_Tarp_or_Not, positive = "BT")
"Threshold: 0.5"
cm.SVM_thresh
acc_SVM <- cm.SVM_thresh[["overall"]][["Accuracy"]]*100
auc_SVM <- SVM_roc[["auc"]]
thresh_SVM <- lr.thresh
sens_SVM <- cm.SVM_thresh[["byClass"]][["Sensitivity"]]*100
spec_SVM <- cm.SVM_thresh[["byClass"]][["Specificity"]]*100
FDR_SVM <- ((cm.SVM_thresh[["table"]][2,1])/(cm.SVM_thresh[["table"]][2,1]+cm.SVM_thresh[["table"]][2,2]))*100
prec_SVM <- cm.SVM_thresh[["byClass"]][["Precision"]]*100
#> [1] "Threshold: 0.5"
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction NBT BT
#> NBT 12229 20
#> BT 15 385
#>
#> Accuracy : 0.9972
#> 95% CI : (0.9962, 0.9981)
#> No Information Rate : 0.968
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.9551
#>
#> Mcnemar's Test P-Value : 0.499
#>
#> Sensitivity : 0.95062
#> Specificity : 0.99877
#> Pos Pred Value : 0.96250
#> Neg Pred Value : 0.99837
#> Prevalence : 0.03202
#> Detection Rate : 0.03044
#> Detection Prevalence : 0.03162
#> Balanced Accuracy : 0.97470
#>
#> 'Positive' Class : BT
#>
"10 Fold Results"
svm.radial.fit$resample #point est +/- std from 10 folds "variation in the third decimal place"...
svm.sd <- sd(svm.radial.fit[["resample"]][["Accuracy"]]*100)
"Standard Deviation of 10 Folds Accuracy"
svm.sd
#> [1] "10 Fold Results"
#> Accuracy Kappa sigma C Resample
#> 1 0.9952569 0.9261242 8.543076 0.25 Fold01
#> 2 0.9960474 0.9362303 8.543076 0.50 Fold01
#> 3 0.9952569 0.9243798 8.543076 1.00 Fold01
#> 4 0.9952532 0.9186126 8.543076 0.25 Fold02
#> 5 0.9960443 0.9330281 8.543076 0.50 Fold02
#> 6 0.9960443 0.9330281 8.543076 1.00 Fold02
#> 7 0.9920949 0.8709184 8.543076 0.25 Fold03
#> 8 0.9928854 0.8794537 8.543076 0.50 Fold03
#> 9 0.9936759 0.8914862 8.543076 1.00 Fold03
#> 10 0.9984202 0.9753740 8.543076 0.25 Fold04
#> 11 0.9992101 0.9875435 8.543076 0.50 Fold04
#> 12 0.9992101 0.9875435 8.543076 1.00 Fold04
#> 13 0.9960506 0.9346736 8.543076 0.25 Fold05
#> 14 0.9968404 0.9483697 8.543076 0.50 Fold05
#> 15 0.9968404 0.9483697 8.543076 1.00 Fold05
#> 16 0.9968379 0.9495855 8.543076 0.25 Fold06
#> 17 0.9968379 0.9495855 8.543076 0.50 Fold06
#> 18 0.9968379 0.9495855 8.543076 1.00 Fold06
#> 19 0.9944664 0.9085407 8.543076 0.25 Fold07
#> 20 0.9968379 0.9495855 8.543076 0.50 Fold07
#> 21 0.9976285 0.9608031 8.543076 1.00 Fold07
#> 22 0.9984177 0.9741830 8.543076 0.25 Fold08
#> 23 0.9984177 0.9741830 8.543076 0.50 Fold08
#> 24 0.9984177 0.9741830 8.543076 1.00 Fold08
#> 25 0.9976266 0.9608005 8.543076 0.25 Fold09
#> 26 0.9976266 0.9608005 8.543076 0.50 Fold09
#> 27 0.9968354 0.9470864 8.543076 1.00 Fold09
#> 28 0.9992095 0.9875432 8.543076 0.25 Fold10
#> 29 0.9992095 0.9875432 8.543076 0.50 Fold10
#> 30 0.9992095 0.9875432 8.543076 1.00 Fold10
#> [1] "Standard Deviation of 10 Folds Accuracy"
#> [1] 0.1890631
The average accuracy across ten folds is 99.72 with a standard deviation of 0.189.
The final values used for the model were degree = 3, scale = 0.1 and C = 1.
| Method | KNN (k = 9) | LDA | QDA | Log. Regression | Random Forest (tuning param = ?) | SVM (tuning param = ?) |
|---|---|---|---|---|---|---|
| Accuracy | 99.76% | 98.4% | 99.5% | 99.53% | 99.94 | 99.72 |
| AUC | 99.98% | 98.91% | 99.84% | 99.92% | 99.12 | 99.88 |
| ROC | ||||||
| Threshold | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 | 0.5 |
| Sensitivity | 96.3% | 81.23% | 85.19% | 88.64% | 98.27 | 95.06 |
| Specificity | 99.88% | 98.96% | 99.98% | 99.89% | 100 | 99.88 |
| FDR | 3.7% | 27.85% | 0.86% | 3.75% | 0 | 3.75 |
| Precision | 96.3% | 72.15% | 99.14% | 96.25% | 100 | 96.25 |
(discussion on FHO data why we do this… what the benefits are… potential pitfalls)
(discussion somewhere about ROC curves AUC and… other metrics)
#| Method | KNN (k = `r k_knn`) | LDA | QDA | Log. Regression | Random Forest (tuning param = ?) | SVM (tuning param = ?)|
#|-------------------------:|:--------------:|:---------:|:---------:|:---------------:|:--------------------------------:|:---------------------:|
#| Accuracy | `r acc_knn_FHO`% | `r acc_lda_FHO`% | `r acc_qda_FHO`% | `r acc_LR_FHO`% | `r acc_RF_FHO` | `r acc_SVM_FHO` |
#| AUC | `r auc_knn_FHO`% | `r auc_lda_FHO`% | `r auc_qda_FHO`% | `r auc_LR_FHO`% | `r auc_RF_FHO` | `r auc_SVM_FHO` |
#| ROC | | | | | | |
#| Threshold | `r thresh_knn_FHO` | `r thresh_lda_FHO` | `r thresh_qda_FHO` | `r thresh_LR_FHO` | `r thresh_RF_FHO` |`r thresh_SVM_FHO` |
#| Sensitivity=Recall=Power | `r sens_knn_FHO`% | `r sens_lda_FHO`% | `r sens_qda_FHO`% | `r sens_LR_FHO`% |`r sens_RF_FHO` | `r sens_SVM_FHO` |
#| Specificity=1-FPR | `r spec_knn_FHO`% | `r spec_lda_FHO`% | `r spec_qda_FHO`% | `r spec_LR_FHO`% |`r spec_RF_FHO` |`r spec_SVM_FHO` |
#| FDR | `r FDR_knn_FHO`% | `r FDR_lda_FHO`% | `r FDR_qda_FHO`% | `r FDR_LR_FHO`% | `r FDR_RF_FHO` |`r FDR_SVM_FHO` |
#| Precision=PPV | `r prec_knn_FHO`% | `r prec_lda_FHO`% | `r prec_qda_FHO`% | `r prec_LR_FHO`% |`r prec_RF_FHO` | `r prec_SVM_FHO` |
#consider if I was able to find an additional data source like lidar or infrared to pair with this dataset to improve model performance... ?
| LDA | QDA | |
|---|---|---|
| Assumptions | this is a lot of text what happens when you put this much text in this table | |
| Tuning Parameters |